Transforming Large Collections of Scientific Publications to XML

نویسندگان

  • Heinrich Stamerjohanns
  • Michael Kohlhase
  • Deyan Ginev
  • Catalin David
  • Bruce R. Miller
چکیده

We describe an experiment transforming large collections of LTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arχiv) using LaTeXML, a LTEX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) LTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and macros, and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arχiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-yearsize undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Transforming the arχiv to XML

We describe an experiment of transforming large collections of LTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the LTEX to XML converter which is currently under development. The main technical task of our arXMLiv project is to supply LaTeXML bindings for the (tho...

متن کامل

A standard TMF modeling for Arabic patents

Patent applications are similarly structured worldwide. They consist of a cover page, a specification, claims, drawings (if necessary) and an abstract. In addition to their content (text, numbers and citations), all patent publications contain a relatively rich set of well-defined metadata. In the Arabic world, there is no North African or Arabian Intellectual Property Office and therefore no u...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل

Croatian National Centre for Biobanking – a new perspective in biobanks governance?

Ethical issues in biobanking, as well as organization and management of biobanks, have become permanent topic of scientific publications in the last 15 years (1). The Expert Group of the European Commission Defines biobanks as collections of various types of biological samples (cells, tissues, blood, DNA) plus related databases. They can be small collections or large national repositories, popu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Mathematics in Computer Science

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010